Outline:

  1. ADC semantic annotation overview/goals
  2. Summary of annotations efforts, to date
  3. Summary of most commonly used semantic annotations
  4. Summary of attributes that are not being annotated

1. Overview

In order to improve data discoverablity within the Arctic Data Center (ADC), the datateam is beginning to incorporate the addition of semantic annotations into the data curation process. Doing so offers a way to standardize the diverse descriptions of data used by researchers across disciplines by attaching terms from controlled vocabularies. The use of semantic annotations provides not only definitions of concepts, but also shows the relationships between different terminology.

Dr. Steven Chong led the first major effort to implement semantic search within the ADC, beginning in 2017, by building out ontological terms pertaining to carbon cycling. You can read more about Steven’s efforts, the ADC’s semantic search product, and its vision moving forward in this blog post. More recently (as of about August 1, 2020), the datateam began making a second push to add semantic annotations to attributes for all incoming data packages to the ADC.

The ADC datateam is currently instructed to add annotations from four main ontologies (the following text was borrowed from the NCEAS Datateam Training):

Here, I explore the current ADC corpus to summarize our progress in implementing semantic search and identify areas for improvement/further consideration.


2. Summary of annotation efforts

How many datapackages have annotations?

  • As of October 12, 2020, the Arctic Data Center contains 6142 data packages (NOTE: a data package consists of a publically-available metadata record, which may be packaged with one or more data files). Of those, 1428 contain data file types that have associated attributes (i.e. variables). Currently, 185 data packages have at least one semantically-annotated attribute.

How many attributes have been annotated, and when were these added?

  • The majority of attributes in those 185 datapackages are annotated (12312/14718), most of which were added during Dr. Chong’s tenure at the ADC (9802/12312, as compared to the 2510/12312 that have been added by the datateam since August 2020).

Which ontology(ies) are the majority of annotations coming from?

  • The vast majority of semantic annotations come from The Ecosystem Ontology (ECSO) (12155/12312, or 98.7%). The remaining come from CHEBI, ENVO, and Wikipedia. See details in table below.

Which annotations are non-resolvable?

See additional details regarding these three non-resolvable URIs in Table 2, below:


3. Which semantic annotations are most commonly used at the attribute level?

The most common semantic annotations used across all ACD metadata records are visualized in Figure 1, below (for sake of space, only terms used more than 20 times are included in Fig.1). These include terms such as soil temperature (used a total of 439 times), relative species abundance (used 305 times), and air temperature (used 283 times). You can explore (and download) the associated data file containing all semantic annotations (not just those used >20 times) currently included in ADC metadata records in Table 1. Dropping the valueURI into your web browswer will take you to the semantic annotation, where you can learn more about its description and relationship to other terms.

NOTICE THAT SOME ANNOTATIONS GOT MORE ATTENTION BEFOER/AFTER AUG…..

NOTE: You can download Table 3 as a .csv file here.


4. Which attributes are not getting annotated? Why?